-
Notifications
You must be signed in to change notification settings - Fork 14.5k
[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733
Conversation
… improve vectorization
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-vectorizers Author: Drew Kersnar (dakersnar) ChangesAt this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward. Full diff: https://github.com/llvm/llvm-project/pull/145733.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 89f63c3b66aad..e14a936b764e5 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -343,6 +343,9 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Propagates the best alignment in a chain of contiguous accesses
+ void propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const;
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -716,6 +719,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // We know that the accesses are contiguous. Propagate alignment
+ // information so that slices of the chain can still be vectorized.
+ propagateBestAlignmentsInChain(C);
+ LLVM_DEBUG({
+ dbgs() << "LSV: Chain after alignment propagation:\n";
+ dumpChain(C);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -1634,3 +1645,27 @@ std::optional<APInt> Vectorizer::getConstantOffset(Value *PtrA, Value *PtrB,
.sextOrTrunc(OrigBitWidth);
return std::nullopt;
}
+
+void Vectorizer::propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const {
+ ChainElem BestAlignedElem = C[0];
+ Align BestAlignSoFar = getLoadStoreAlignment(C[0].Inst);
+
+ for (const ChainElem &E : C) {
+ Align OrigAlign = getLoadStoreAlignment(E.Inst);
+ if (OrigAlign > BestAlignSoFar) {
+ BestAlignedElem = E;
+ BestAlignSoFar = OrigAlign;
+ }
+
+ APInt OffsetFromBestAlignedElem =
+ E.OffsetFromLeader - BestAlignedElem.OffsetFromLeader;
+ assert(OffsetFromBestAlignedElem.isNonNegative());
+ // commonAlignment is equivalent to a greatest common power-of-two divisor;
+ // it returns the largest power of 2 that divides both A and B.
+ Align NewAlign = commonAlignment(
+ BestAlignSoFar, OffsetFromBestAlignedElem.getLimitedValue());
+ if (NewAlign > OrigAlign)
+ setLoadStoreAlignment(E.Inst, NewAlign);
+ }
+ return;
+}
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
new file mode 100644
index 0000000000000..a1878dc051d99
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
@@ -0,0 +1,296 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes=load-store-vectorizer -S < %s | FileCheck %s
+
+; The IR has the first float3 labeled with align 16, and that 16 should
+; be propagated such that the second set of 4 values
+; can also be vectorized together.
+%struct.float3 = type { float, float, float }
+%struct.S1 = type { %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store float 0.000000e+00, ptr %1, align 16
+ %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+ store float 0.000000e+00, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoad(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l1 = load float, ptr %1, align 16
+ %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+ %l2 = load float, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+ %l3 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+ %l4 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+ %l5 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+ %l6 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+ %l7 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+ %l8 = load i32, ptr %getElem13, align 4
+ ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store float 0.000000e+00, ptr %1, align 16
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ store float 0.000000e+00, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoadi8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l1 = load float, ptr %1, align 16
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ %l2 = load float, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+ %l3 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+ %l4 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+ %l5 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+ %l6 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+ %l7 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+ %l8 = load i32, ptr %getElem13, align 4
+ ret void
+}
+
+
+; This version of the test adjusts the struct to hold two i32s at the beginning,
+; but still assumes that the first float3 is 16 aligned. If the alignment
+; propagation works correctly, it should be able to load this struct in three
+; loads: a 2x32, a 4x32, and a 4x32. Without the alignment propagation, the last
+; 4x32 will instead be a 2x32 and a 2x32
+%struct.S2 = type { i32, i32, %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store i32 0, ptr %1, align 8
+ %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+ store i32 0, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+ store float 0.000000e+00, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+ store float 0.000000e+00, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoad_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT: [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT: [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT: [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l = load i32, ptr %1, align 8
+ %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+ %l2 = load i32, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+ %l3 = load float, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+ %l4 = load float, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+ %l5 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+ %l6 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+ %l7 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+ %l8 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+ %l9 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+ %l0 = load i32, ptr %getElem13, align 4
+ ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store i32 0, ptr %1, align 8
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ store i32 0, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+ store float 0.000000e+00, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+ store float 0.000000e+00, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoadi8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT: [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT: [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT: [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l = load i32, ptr %1, align 8
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ %l2 = load i32, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+ %l3 = load float, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+ %l4 = load float, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+ %l5 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+ %l6 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+ %l7 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+ %l8 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+ %l9 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+ %l0 = load i32, ptr %getElem13, align 4
+ ret void
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please wait for review by someone with more LSV experience before landing.
Adding @michalpaszkowski to reviewers since I saw you reviewed a recently merged LSV change. If you can think of anyone else who would be a better fit to review this let me know :) |
%l7 = load i32, ptr %getElem12, align 8 | ||
%getElem13 = getelementptr inbounds i8, ptr %1, i64 28 | ||
%l8 = load i32, ptr %getElem13, align 4 | ||
ret void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use more representative tests with some uses? As is these all optimize to memset or no-op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight pushback, my understanding is that lit tests are most useful when they are minimal reproducers of the problem the optimization is targeting. Adding uses would not really change the nature of this optimization. Tests like llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i1.ll
follow this thinking.
If you think it would be better, I could combine each pair of load and store tests into individual tests, storing the result of the loads. Other LSV tests use that pattern a lot.
; The IR has the first float3 labeled with align 16, and that 16 should | ||
; be propagated such that the second set of 4 values | ||
; can also be vectorized together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this papering over a missed middle end optimization? The middle end should have inferred the pointer argument to align 16, and then each successive access should already have a refined alignment derived from that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is nice to have for a variety of edge cases, but the specific motivator for this change is based on InstCombine's handling of nested structs. You are correct that if loads/stores are unpacked from aligned loads/stores of aggregates, alignment will be propagated to the unpacked elements. However, with nested structs, InstCombine unpacks one layer at a time, losing alignment context in between passes over the worklist.
Before IC:
; RUN: opt < %s -passes=instcombine -S | FileCheck %s
%struct.S1 = type { %struct.float3, %struct.float3, i32, i32 }
%struct.float3 = type { float, float, float }
define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) {
%v.addr = alloca ptr, align 8
%vout.addr = alloca ptr, align 8
store ptr %v, ptr %v.addr, align 8
store ptr %vout, ptr %vout.addr, align 8
%tmp = load ptr, ptr %vout.addr, align 8
%tmp1 = load ptr, ptr %v.addr, align 8
%1 = load %struct.S1, ptr %tmp1, align 16
store %struct.S1 %1, ptr %tmp, align 16
ret void
}
After IC:
define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) {
%.unpack.unpack = load float, ptr %v, align 16
%.unpack.elt7 = getelementptr inbounds nuw i8, ptr %v, i64 4
%.unpack.unpack8 = load float, ptr %.unpack.elt7, align 4
%.unpack.elt9 = getelementptr inbounds nuw i8, ptr %v, i64 8
%.unpack.unpack10 = load float, ptr %.unpack.elt9, align 8
%.elt1 = getelementptr inbounds nuw i8, ptr %v, i64 12
%.unpack2.unpack = load float, ptr %.elt1, align 4
%.unpack2.elt12 = getelementptr inbounds nuw i8, ptr %v, i64 16
%.unpack2.unpack13 = load float, ptr %.unpack2.elt12, align 4 ; <----------- this should be align 16
%.unpack2.elt14 = getelementptr inbounds nuw i8, ptr %v, i64 20
%.unpack2.unpack15 = load float, ptr %.unpack2.elt14, align 4
%.elt3 = getelementptr inbounds nuw i8, ptr %v, i64 24
%.unpack4 = load i32, ptr %.elt3, align 8
%.elt5 = getelementptr inbounds nuw i8, ptr %v, i64 28
%.unpack6 = load i32, ptr %.elt5, align 4
store float %.unpack.unpack, ptr %vout, align 16
%vout.repack23 = getelementptr inbounds nuw i8, ptr %vout, i64 4
store float %.unpack.unpack8, ptr %vout.repack23, align 4
%vout.repack25 = getelementptr inbounds nuw i8, ptr %vout, i64 8
store float %.unpack.unpack10, ptr %vout.repack25, align 8
%vout.repack17 = getelementptr inbounds nuw i8, ptr %vout, i64 12
store float %.unpack2.unpack, ptr %vout.repack17, align 4
%vout.repack17.repack27 = getelementptr inbounds nuw i8, ptr %vout, i64 16
store float %.unpack2.unpack13, ptr %vout.repack17.repack27, align 4
%vout.repack17.repack29 = getelementptr inbounds nuw i8, ptr %vout, i64 20
store float %.unpack2.unpack15, ptr %vout.repack17.repack29, align 4
%vout.repack19 = getelementptr inbounds nuw i8, ptr %vout, i64 24
store i32 %.unpack4, ptr %vout.repack19, align 8
%vout.repack21 = getelementptr inbounds nuw i8, ptr %vout, i64 28
store i32 %.unpack6, ptr %vout.repack21, align 4
ret void
}
To visualize what's happening under the hood, InstCombine is unpacking the load in stages like this:
%1 = load %struct.S1, ptr %tmp1, align 16
->
load struct.float3 align 16
load struct.float3 align 4
load i32 align 8
load i32 align 4
->
load float align 16
load float align 4
load float align 8
load float align 4
load float align 4
load float align 4
load i32 align 8
load i32 align 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the instcombine is fixed to propagate the alignment properly, will this patch still be useful/needed? If not, then I would agree with Matt that we should try fixing the source of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instcombine case is the only one I currently know of those benefits from this specific optimization, but I would suspect there could be a different way to arrive at a pattern like this.
But let's ignore those hypotheticals, because I understand the desire to stick within the realm of known use cases. The question is, are we comfortable changing the unpackLoadToAggregate
algorithm in InstCombine to recurse through nested structs?
llvm-project/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
Lines 788 to 801 in 8c32f95
for (uint64_t i = 0; i < NumElements; i++) { | |
Value *Indices[2] = { | |
Zero, | |
ConstantInt::get(IdxType, i), | |
}; | |
auto *Ptr = IC.Builder.CreateInBoundsGEP(AT, Addr, ArrayRef(Indices), | |
Name + ".elt"); | |
auto EltAlign = commonAlignment(Align, Offset.getKnownMinValue()); | |
auto *L = IC.Builder.CreateAlignedLoad(AT->getElementType(), Ptr, | |
EltAlign, Name + ".unpack"); | |
L->setAAMetadata(LI.getAAMetadata()); | |
V = IC.Builder.CreateInsertValue(V, L, i); | |
Offset += EltSize; | |
} |
I feel like implementing that would be a bit of a mess, and leads to questions like "how many layers deep should it recurse"? Unpacking one layer at a time and then adding the nested struct elements to the IC worklist to be independently operated on in a later iteration is a much cleaner solution and feels more in line with the design philosophy of InstCombine. I could be wrong though, open to feedback or challenges to my assumptions.
And just to be clear, in case my previous explanation was confusing, the reason it would have to recurse through all layers at once is because we cannot store the knowledge that "the second element of load struct.float3 align 4
is aligned to 16" on the instruction; there isn't a syntax available to express that.
Edit: We would also have to change unpackStoreToAggregate to do this too, further increasing the code complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the align 16 with opt -attributor-enable=module -O3
. The default attribute passes do a worse job it seems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Analysis implies that we collect the information, but we're not making any actual changes to the IR.
Collecting the chains would fit that description. Collecting the chains and applying new alignment would become a transformation.
Yeah you're totally right, what I meant was in this case I think the alignment upgrading transformation is a bit of an implementation hack to carry the computed alignment through to the end of the pass, at which point it will be assigned to the newly created vectorized load. Theoretically we could instead add an "alignment" field to each ChainElem in the chain data structure and upgrade the alignment in memory without touching the IR until the final vectorized load/store is created (do we theoretically like this idea more?). Just wanted to point out this detail in case that changes the equation.
I appreciate your patience answering my questions and entertaining my half-baked ideas and hypothetical scenarios.
Totally, this is worth thinking through!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be expensive? I'd expect the implementation in InferAlignments would basically look like this: Walk the block and for all accesses on constant offset GEPs remember the largest implied alignment for the base (without offset) and use that to increase alignment on future accesses with the same base.
This is interesting. I can explore this idea to see how feasible it is. I think if we wanted something that could perform as well as this LSV optimization, it need to call getUnderlyingObject and getConstantOffset for each load/store. I can look into the time complexity of that, but my hunch is it would still end up O(n^2).
The problem is that I think if we want to avoid those expensive calls and limit the analysis to "shallow" geps that offset directly off of aligned base pointers, we will miss a lot of cases. There are transformations like StraightLineStrengthReduce that build chains of geps that increase the instruction depth needed to find the underlying object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I meant was in this case I think the alignment upgrading transformation is a bit of an implementation hack to carry the computed alignment through to the end of the pass, at which point it will be assigned to the newly created vectorized load.
I think it may be another way to say what I said above -- we've established that the change is sensible, it just may not be the best place/time for it to be done in LSV.
So, yes, at the moment the upgraded alignment is only used by LSV, but that's exactly the reason why it may not be the best place for it. Having better (or just known) alignment info is generally helpful and we want to make it visible to other passes, independently of LSV. E.g. there may be cases where LSV would not be able to help much, but we could still improve lowering of non-vectorized loads/stores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having better (or just known) alignment info is generally helpful and we want to make it visible to other passes, independently of LSV.
Are there any existing passes in particular that would benefit from this upgraded alignment? If there is nothing currently existing that would take advantage of it, I lean towards moving forward with this improvement in the LSV for now and considering generalizing it in the future.
I think the compile time tradeoffs that come with implementing it in InstCombine or InferAlignments would be not worth it. For InstCombine specifically, we have a lot of practical issues with the compile time stress it puts on our compiler at NVIDIA, so we tend to avoid adding complexity to it when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any existing passes in particular that would benefit from this upgraded alignment?
Isel benefits from better alignment information in general.
I think the compile time tradeoffs that come with implementing it in InstCombine or InferAlignments would be not worth it. For InstCombine specifically, we have a lot of practical issues with the compile time stress it puts on our compiler at NVIDIA, so we tend to avoid adding complexity to it when possible.
We should definitely not infer better alignments in InstCombine. We specifically split out the InferAlignments pass out of InstCombine to reduce compile-time from repeatedly inferring alignment.
I think with the scheme I proposed above doing this in InferAlignment should be close to free, but of course we won't know until we try.
The problem is that I think if we want to avoid those expensive calls and limit the analysis to "shallow" geps that offset directly off of aligned base pointers, we will miss a lot of cases. There are transformations like StraightLineStrengthReduce that build chains of geps that increase the instruction depth needed to find the underlying object.
I'm not sure we'll be losing much in practice. I'd generally expect that after CSE, cases where alignment can be increased in this manner will have the form of constant offset GEPs from a common base. All of the tests you added in this PR are of this form. Did you see motivating cases where this does not hold?
PropagateAlignments(C); | ||
PropagateAlignments(reverse(C)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to go through twice? I'd expect each load and store to have started with an optimally computed alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_forward_and_reverse
in the test file demonstrates why going backwards could theoretically be useful. If you have a well-aligned element later in the chain, propagating the alignment backwards could improve earlier elements in the chain.
I haven't thought of a specific end-to-end test that would trigger this, but I think it can't hurt to include.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the alignment info at the beginning/end of the chain conflict? Which one should be believed?
I think we should only propagate alignment info in one direction, forward, as it's the alignment of the base pointer that's the ground truth there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the alignment info at the beginning/end of the chain conflict?
I think this isn't possible if the input IR is correct, right? For example, if one piece of analysis proves that a load is aligned to 4, and another proves it is aligned to 16, they are both correct, it's just that the 16 is a more precise result, and thus we should prefer it as it gives us more information.
I think we should only propagate alignment info in one direction, forward, as it's the alignment of the base pointer that's the ground truth there.
I can't really argue against that from a practicality perspective, as I haven't come up with a use case for the reverse propagation. But it would be correct, as the alignment argument on any load/store input IR has to be accurate, right? https://llvm.org/docs/LangRef.html#load-instruction for reference. "It is the responsibility of the code emitter to ensure that the alignment information is correct. Overestimating the alignment results in undefined behavior. Underestimating the alignment may produce less efficient code."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against reverse propagation of the alignment, I'm just not convinced that it will buy us much on top of the forward propagation. If I'm wrong, it's trivial to add it.
I also suspect that it will have corner cases. E.g something like this:
Imagine that we have multiple nested function calls passing a semi-opaque structure pointer around. The intermediate functions only know about the structure header, and the leaf functions know about the opaque data that follows. The top-level caller guarantees that the pointer is properly aligned for the leaf function consumption, but intermediate layers do not.
Let's assume that we have two leaf functions f_aligned and f_default. f_aligned casts the pointer to the opaque blob to a type aligned by 16 and does a load.
Back-propagation would infer alignment for the access in the intermediate functions (let's assume they all get inlines, so we see all memory accesses).
The f_default
function applies no explicit alignment and issues a load with no alignment set, so we can't infer anything from it. However, the caller also does not do anything extra about the alignment of the pointer it passes along, so it's not guaranteed to have high alignment.
Before inferring alignment IR is valid for the calling of both f_aligned and f_default.
However, after we reverse-propagate alignment info, the IR will be valid for the code path that involves f_aligned only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty much in the same camp actually. I added it based on another review, but we really don't have a clear motivator for it. I'm fine with leaving it out until we find such a motivator.
This was my comment. Here is my understanding and how I've thought about this. A chain of loads/stores is a group of these operations where we know they all have fixed offsets relative to each other (and since we're considering vectorizing they all have the same control dependency). The insight of this change is that the offsets can be used in combination with the alignments already on these memory operations to deduce better alignments for other elements in the chain. We're doing this by finding the operation with the greatest alignment and then upgrading the alignments of other operations based on their relative offset to this best alignment.
The question is whether we should apply this deduction to all elements or restrict it to only elements which have addresses greater than the most-aligned-operation. I think it's clearest conceptually and potentially improves perf to just apply the deduction to all members of the chain. Limiting to only items with greater addresses isn't going to buy us much in terms of reducing complexity or compile time (even if these are the only cases we'd need to address the IC multi-level struct issue).
Will we propagate align 16 to load of gep32?
What will be the alignment of 'load base' ?
What if the order of loads of gep_32 and gep_16 changes? We should end up with the same final alignment regardless of it. With the code iterating over the chain linearly, it may have trouble dealing with propagation across sibling branches.
I think we should apply align 16
to all 3 loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure it it's literally a chain from the leaf to the root, or actually a graph with all known pointers derived from the same base.
So in this specific case, this chain would be split earlier in the algorithm in splitChainByContiguity, because there are memory gaps between the loads. But for the sake of your question let's ignore that.
The chain would be a list of elements and their offset from the base chain element. It's never a graph. Loads/stores that are not able to be boiled down to this format are tossed away. So your example would look like this:
{Inst = load i32, base
, Offset = 0}
{Inst = load i32, gep_16, align 16
, Offset = 16}
{Inst = load i32, gep_32, align 4
, Offset = 32}
What if the order of loads of gep_32 and gep_16 changes? We should end up with the same final alignment regardless of it. With the code iterating over the chain linearly, it may have trouble dealing with propagation across sibling branches.
The basic block order does not matter to the algorithm, at least not after it proves there are no data dependencies in splitChainByMayAliasInstrs. After that point, it sorts the chain by Offset order, so any permutation of the original IR would deterministically end up as the same chain.
With the code iterating over the chain linearly, it may have trouble dealing with propagation across sibling branches.
This would be true if it was represented as a graph, but it is not. It is a list of tuples, each tuple containing an instruction and offset. Each instruction has been proven to have the same underlying object and all that matters for the algorithm is the offset from the "head" of the chain, which is often pointing directly to that underlying object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Alex's framing of this solution, thank you for your explanation.
I think it's clearest conceptually and potentially improving perf to just apply the deduction to all members of the chain. Limiting to only items with greater addresses isn't going to buy us much in terms of reducing complexity or compile time
I think this is quite reasonable. Rather than "why should we propagate backwards", "why shouldn't we propagate backwards" is potentially a better question, because there is a risk of overfitting the solution to the use case if we don't answer that question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation.
So, this case, the combination of forward and backward propagation is equivalent to finding the largest alignment at a known offset, and then applying that known alignment to other operations in the chain. Forward pass applies alignment to the operations from the max alignment point to the subsequent operations, and the reverse pass will apply it to the preceding ones.
We may be doing somewhat unnecessary work (we will be applying smaller alignment until we reach the max point and we'll eventually apply the max alignment on the reverse pass), but it should be valid.
Perhaps it would make sense to separate the find the max alignment from upgrading the alignment. Finding max alignment may be constrained on finding the large-enough value, so we can stop the search early, as we don't need alignments larger than the size of the largest aligned load. It also makes the forward/backward distinction moot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we don't need alignments larger than the size of the largest aligned load
Actually, I think we would want to search for alignments up to the memory size of the chain, and at that point it's probably not worth including an early exit case, as I don't think we have enough context at this point to build an easy heuristic.
However, I agree that the algorithm you are suggesting is more intuitive/readable and saves some calls to setMaxAlignment with the same number of iterations, even without the early exit. I just updated it and I think it's improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have the feeling this is beyond the scope of what the vectorizer should be doing. Attributor / functionattrs + instcombine should be dealing with this
@arsenm Just in case my message gets buried in the other thread, I'll put it here too: I'd love to look closer at this, can you share details about your repro/commands? Thank you :) |
At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.